Comparisons of statistical distributions for cluster sizes in a developing pandemic

Background We consider cluster size data of SARS-CoV-2 transmissions for a number of different settings from recently published data. The statistical characteristics of superspreading events are commonly described by fitting a negative binomial distribution to secondary infection and cluster size data as an alternative to the Poisson distribution as it is a longer tailed distribution, with emphasis given to the value of the extra parameter which allows the variance to be greater than the mean. Here we investigate whether other long tailed distributions from more general extended Poisson process modelling can better describe the distribution of cluster sizes for SARS-CoV-2 transmissions. Methods We use the extended Poisson process modelling (EPPM) approach with nested sets of models that include the Poisson and negative binomial distributions to assess the adequacy of models based on these standard distributions for the data considered. Results We confirm the inadequacy of the Poisson distribution in most cases, and demonstrate the inadequacy of the negative binomial distribution in some cases. Conclusions The probability of a superspreading event may be underestimated by use of the negative binomial distribution as much larger tail probabilities are indicated by EPPM distributions than negative binomial alternatives. We show that the large shared accommodation, meal and work settings, of the settings considered, have the potential for more severe superspreading events than would be predicted by a negative binomial distribution. Therefore public health efforts to prevent transmission in such settings should be prioritised. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01517-9.


Background
In the public communication of information about the novel coronavirus disease COVID-19 there is emphasis frequently given to estimates of the basic reproduction number R 0 , the expected number of secondary cases arising from a single primary case in a fully susceptible population, and the effective reproduction number R t which is similarly defined to R 0 but when a population is subject to control measures. Values of such R's greater than one lead to the exponential growth of daily case numbers. Of next importance, or perhaps equal, to the value of R 0 or R t is a measure of individual variation in infectiousness. Such variation is important because it can explain the occurrence of so-called superspreading events (SSE's) where the numbers of secondary and subsequent cases are substantially more than expected from assuming a Poisson distribution for the numbers of cases. In a Nature News Feature [1] the importance of SSE's in the continued COVID-19 pandemic is emphasized and relevant research reviewed.
SSE's were considered in [2] by comparing right tail probabilities of the Poisson distribution with alternatives for the distribution of secondary cases. For highly skew negative binomial distributions, with variance equal to mean + mean 2 /k and small values of the extra parameter k, the tail probabilities are much larger than those for the corresponding Poisson distribution with the same mean. The paper emphasises the importance of recognising the reasonable chance of SSE's in the effective control of infectious diseases by targeting individual-specific control measures rather than population wide measures. The importance of SSE's in the COVID-19 pandemic is shown in [3] where the transmission of the virus SARS-CoV-2 is characterised by high stochasticity under low prevalence. The importance of the control of SSE's in the transmission of SARS-CoV-2 is stressed, and various types of SSE are described using negative binomial tail probabilities with small dispersion parameter k to quantify the probability of an SSE.
Genomic epidemiology was used [4] to investigate the transmission of SARS-CoV-2 in the Boston, MA, area concluding that not all SSE's for COVID-19 have the same impact on the community. Two SSE's were compared. One, in a skilled nursing facility led to widespread transmission within the facility, while another, in an international business conference, had a much greater impact on the local, US and international community with a very large number of transmissions. This study does suggest that SSE's can affect in diverse ways the course of an epidemic, and that prevention, detection, and mitigation of such events should be a priority for public health efforts.
Given data on the number of secondary cases arising from each primary case, either individual secondary case data or clusters of such data, then the epidemiological literature has suggested using the negative binomial distribution as an alternative to the Poisson distribution. Cluster size includes all primary or initial cases, secondary cases and all subsequent derived cases. This distribution was considered [2] for the number of secondary cases for a number of infectious diseases including severe acute respiratory syndrome (SARS). A formula was derived for the probability of a given cluster size given a known number of initial cases, and the dispersion parameter k was estimated to be 0.16 with 90% confidence interval (0.11, 0.64) from data on the Singapore SARS outbreak due to the SARS-CoV-1 virus. The negative binomial and geometric distributions were favoured over the Poisson distribution for several disease datasets.
For Middle East respiratory syndrome coronavirus (MERS-CoV) data clusters of cases from around the world were considered in [5], and a negative binomial distribution for the number of secondary cases was fitted and estimates reported for MERS-CoV: k = 0.26 (90% CI: 0.11, 0.87, 95% CI: 0.09, 1.24).
For COVID-19 data sets, the negative binomial distribution was fitted to sizes of outbreaks or clusters outside China in [6]. A dispersion parameter k estimate of 0.1 was reported, with 95% Credible Interval (0.05, 0.2) for R 0 equal to 2.5 (thereby ignoring any uncertainty in the estimation of R 0 ). Joint inference for both R 0 and k does have a large degree of uncertainty, but even so using WBIC it was shown that the negative binomial distribution was greatly favoured over the Poisson distribution.
The zero-truncated negative binomial distribution was fitted to COVID-19 secondary case data in [7], including possible cluster sizes of one; that is, cases with no further infections. The parameter k was estimated to be 0.32, 95% CI (015, 0.64).
Given the common use of the negative binomial distribution to model the distribution of secondary cases, cluster sizes and hence SSE's, it is important and of interest to investigate whether other long tailed distributions can reasonably describe the distribution of secondary cases and cluster sizes for SARS-CoV-2 transmissions and whether these distributions lead to more extreme SSE's than predicted by a negative binomial distribution.

Data
The data analysed in this paper are from an Excel database dated 06-07-2020 (North American style) [8]. They were collected for various "settings" within which infections took place; these ranged from "Building site" to "Work". The database gave "The total number of cases per cluster" for all 265 entries, with breakdowns into numbers of "primary cases" and" secondary cases" only included for less than a third of these. So to maximize the amount of data available for analysis, cluster sizes were considered as the response with settings as a covariate. Transmission of infection was presumed to be limited to the specific setting with no infections from other settings included.
The overall aim in [8] was to gather information on reported clusters of COVID-19 cases to determine the different settings in which SARS-CoV-2 transmission was occurring. This was done from a search of the scientific literature and media articles detailing clusters of SARS-CoV-2 transmission. Data from such sources will have several limitations, including inherent recall bias, biased media reporting and incomplete information -deciding to work with data on cluster sizes being a consequence of the last of these. About half the data related to outbreaks occurring in China and Singapore.

Modelling
The Poisson distribution ( [9], chapter 4) is a basic statistical model for count data and corresponds to "events" occurring randomly over time in a Poisson process of fixed "rate", λ > 0; that is, where the probability of an event occurring in a short time interval of length δt is λδt, independently of the occurrence of any other events. The number of events occurring in a finite time interval of length t (which can be taken, without loss of generality, to be one) then has the Poisson distribution with mean = variance = λ. This limitation on the variation has prompted generalisations to admit variance greater than the mean with the negative binomial ( [9], chapter 5) being one such generalisation where the rate of the Poisson process is allowed to vary between different observed processes according to a gamma distribution with shape parameter k (> 0); this leads to a mixed Poisson distribution with variance = mean + mean 2 /k. Such a measure, k, of extra variation compared to Poisson in any mixed Poisson distribution expressed generally as μ 2 /(σ 2 -μ), where μ and σ 2 are the mean and variance of the counts, is the reciprocal of the square of the coefficient of variation of the mixing distribution. So the smaller this quantity is the greater the extra variation with the Poisson distribution corresponding to an infinite value. An alternative way of varying the rate of the underlying Poisson process is to relax the above assumption of independence and have dependency on the accumulating number of events n, say, occurring: λ n = a(n + k), n ≥ 0, corresponds to the negative binomial distribution with mean k [exp(a) -1]. Having λ n = a(n + b) c with c ≠ 1 thus generalises the negative binomial distribution (c = 1 and b = k) with c < 0 resulting in variance < mean, and c > 0 variance > mean [10]. So this generalisation admits a range of dispersion from sub-Poisson (c < 0) through Poisson (c = 0) to negative binomial (c = 1) and beyond (c > 1). Details of calculating the probabilities from sequences λ n , n ≥ 0, are given in an Additional file 1.
This extended Poisson process modelling (or EPPM) was introduced in [11] with subsequent papers ( [12] and references therein) developing and applying this modelling. A complication arises when the power parameter c in the above formulation of the λ n 's exceeds one, in that the resulting probability distribution is dishonest with the sum of the probabilities being less than one, and they require re-normalising by dividing by this sum before they can be applied -details of doing this in practice are also given in the Additional file 1.
Using this λ n = a(n + b) c , n ≥ 0, formulation a single degree of freedom test for departures from negative binomial variation thus corresponds to testing the hypothesis c = 1 (with b = k). Although c = 0 corresponds to λ n being constant and the Poisson distribution, the parameter b becomes redundant in this special case. However, the limiting case of b → ∞ and c → ±∞ leads to λ n = α exp. (βn) with β = 0 corresponding to the Poisson distribution, and β > 0 (< 0) to probability distributions with variance >(<) mean (again, re-normalisation of the probabilities is required if β > 0). Hence a single degree of freedom test for departures from the Poisson distribution can be carried out.

Model fitting
All the probability distributions considered: (i) basic Poisson distribution with variance = mean, (ii) EPPM with λ n = α exp. (βn), n ≥ 0 (to assess the adequacy of the Poisson distribution, β = 0), (iii) negative binomial distribution (if Poisson is inadequate, to estimate the additional dispersion parameter k), and (iv) EPPM with λ n = a(n + b) c , n ≥ 0 (to assess the adequacy of the negative binomial, c = 1) Have support on 0, 1, 2, … but cluster sizes are by definition non-zero. Distributions (i) -(iv) were therefore fitted to cluster sizes minus one for each setting. Maximum likelihood estimation (again, practical details including MATLAB ® code are in the Additional file 1), rather than simple moment estimation, was used in fitting the distribution and the results compared across the different settings. For each setting, a Poisson model was first fitted to the data and compared with an EPPM (ii) fit using a likelihood ratio test; if there were no significant improvement then the Poisson model was selected. If there were significant improvement, then a negative binomial model was fitted and compared with an EPPM (iv) fit; if this resulted in no further significant improvement in fit then the negative binomial model was chosen. If there were some significant improvement, then the EPPM (iv) model was selected; since this was the most general of the models, assessments of the resulting fits were done using P-P plots.

Results
The data [8] provided ten subsets of specific settings of SARS-CoV-2 transmission, where there were sufficient numbers of clusters (at least 10) for reasonable prospects of getting Sufficiently precise estimates. These were: elderly care (n = 21), food processing plants (n = 21), household (n = 38), large shared accommodation (n = 29), meal (in restaurants, etc., n = 17), party (n = 14), religious (gatherings, n = 22), school (n = 11), sport (n = 22) and work (n = 15). Table 1 shows results from the above fitting process, with the most parsimoniously parameterised model chosen. The P-P plots in Fig. 1 show the improvement in fit of the EPPM (iv) over the negative binomial model for the work data, with the former showing quite a good fit to the data and a better match to the upper tail of the empirical distribution. The models fitted to the large shared accommodation and meal data showed similar P-P plots. Only household transmission could reasonably be described by the Poisson distribution (chi-squared goodness of fit statistic 2.21 on 4 d.f.), with all the other settings showing highly significant departures from Poisson variation (p-values < 0.001). The negative binomial distribution was quite adequate for elderly care, food processing plants, party, religious, school and sport. However, large shared accommodation, meal and work all showed significantly more variation than that described by the negative binomial distribution.
The maximum likelihood estimates of the parameter k from the negative binomial fits to the above mentioned six subsets varied between 0.66 with standard error 0.17 (food processing) and 1.16 with standard error 0.34 Table 1 Details of model fitting for ten different settings for COVID-19 cluster size data [8]. One of three possible models [(i) -(iv), as described in the text] was chosen based on the log-likelihood ratio statistic, the value of which is given for testing a simpler model versus a more complex one [β = 0 for (ii) versus (i), and c = 1 for (iv) versus (iii)]. The p-value of this statistic is also given (elderly care). These relatively large standard errors suggest that this parameter may be assumed to be constant across the subsets, which is confirmed by fitting negative binomial distributions with the same value of the k parameter but different values of the means to these data: log-likelihood ratio statistic 2.75 on 5 d.f. Shown in Table 2 are the estimated means and (common) parameter k.
Estimates of the parameter k for the large shared accommodation, meal and work subsets had to be calculated from first principles from the fitted distributions, as there are no simple algebraic expressions for the moments from the EPPM formulation; this does make likelihood based inferences problematic. The three estimates of k and their standard errors are also shown in Table 2; all three estimates are lower than the (common) estimate of k for the other settings (excluding household). However, the standard errors are very large, suggesting that the three subsets having the same value of k could be a quite acceptable hypothesis, but also that k might not be such a useful measure of extra variation for long-tailed distributions like EPPM's,.
In Fig. 2 are plots of the fitted probability distributions corresponding to the estimated models described in Tables 1 and 2. As the means varied from about four (household) to nearly 200 (food processing), upper tail probabilities have been plotted against multiples of their respective means. The negative binomial distributions (elderly care, food processing. plants, party, religious, school and sport) all with a common value of the parameter k are quite similar showing a roughly linear decline on the log-scale. The EPPM distributions for.
Meal and work with lower values of k are also quite similar but convex decreasing (on the log-scale) indicating the much longer tails of these distributions. While the EPPM distribution for large shared accommodation with an even lower k and also a lower value of the parameter c shows even longer tails. The Poisson (household) distribution shows its much shorter tail with a concave decreasing plot.

Discussion
Maximum likelihood estimation has been used here to estimate the dispersion parameter k, rather than moment estimation which seems to be more common in epidemiological studies [2,[5][6][7]. Indeed, the estimates quoted in Table 2 are somewhat larger than those from these studies, although they are all rather imprecise (with wide confidence intervals or large standard errors). Estimates based on moments, or k-statistics, do have a tendency to be smaller with data from long-tailed probability distributions as simple moments are more influenced by extremes than maximum likelihood estimates.
Even though, as shown in Fig. 2, larger tail probabilities of the negative binomial compared to Poisson are plainly apparent, superspreading may be underestimated by use of the negative binomial distribution [2,3] as much larger tail probabilities are indicated from the EPPM distributions.
Cluster sizes are necessarily positive and have been modelled using truncated (zero excluded) distributions with support on 1, 2, … in [7]. Here, untruncated distributions have been used to model cluster size minus one which has an interpretation as the number of additional infections in a superspreading event initiated by Table 2 Details of parameter estimates with standard errors (s.e.) for ten different settings for COVID-19 cluster size data [8]. Values are given for the dispersion parameter, or k-parameter, defined for all models fitted and the c-parameter for the EPPM models. A value of 0 for the c-parameter corresponds to a Poisson model and a value of 1 a negative binomial model a single infected individual. Both truncated and untruncated modelling gave similar results from the data considered here, with no one type consistently preferred over the other by likelihood based criteria. However, untruncated EPPM's were preferred over truncated for the large shared accommodation, work and meal data. Extreme value like distributions offer other longer-tailed alternatives to the negative binomial distribution, but the theory behind such distributions (i.e., maxima of equal sample sizes) would not apply to epidemic clusters. Pragmatically though, distributions of such forms could be applied to any data but they should be in discrete form for numbers of infections. For example, a discrete Pareto distribution with probability mass function proportional to (n + a) -(b + 1) , for n = 0, 1, 2, … with a, b > 0, gave much the same fits to the large shared accommodation and work data as the EPPM's, but a rather worse fit to the meal data -much the same as the negative binomial in fact. Such distributions are quite different from EPPM's, without the Poisson and negative binomial distributions as special cases. And the EPPM family of distributions does have the appeal of containing these standard distributions as special cases enabling significance testing of improvements from more general alternatives within the family.
Settings of SARS-CoV-2 infection have been the subject of other recent studies. In [13] the authors investigated the risk of transmission in outdoor settings compared with indoor settings, conducting a systematic review of peer-reviewed papers and identifying five studies which found a lower proportion of reported global SARS-CoV-2 infections occurring outdoors: odds-ratio of 18.7 (95% confidence interval 6.0-57.9) for infection indoors compared with outdoors. And in [14] the authors analysed outbreaks by industry sector in the first wave of the pandemic, and associated household cases. They found that 80% of cases belonged to three sectors: manufacturing; agriculture, forestry, fishing and hunting; and transportation and warehousing. Household cases were associated with 31% of outbreak cases, increasing the burden of illness by 56%. The results presented here are not dissimilar in that SSE's were associated with the work and meal settings, the latter being an indoor setting as would be large shared accommodation. And households being associated with increasing the burden of illness would not be at variance with the finding here of household infection not initiating SSE's per se.

Conclusions
Long-tailed probability distributions are necessary to adequately describe the variation in sizes of clusters of infections emanating from a common source, as the Poisson distribution was quite inadequate for all but one of  Tables 1 and 2 for COVID-19 cluster size data [8]. These are of upper tail probabilities of cluster size plotted against multiples of their respective means. Poisson (dotted line), negative binomial (solid lines) and EPPM (dashed lines)